Clustering with Spectral Norm and the /c-means Algorithm 



Amit Kumar * Ravindran Kannan 

Dept. of Computer Science and Engg. Microsoft Research India Lab. 

IIT Delhi, New Delhi Bangalore 

email: amitk@cse . iitd . ac . in email: kannan@microsoft.com 

April 13, 2010 



Abstract 

There has been much progress on efficient algorithms for clustering data points generated by a mix- 
ture of k probability distributions under the assumption that the means of the distributions are well- 
separated, i.e., the distance between the means of any two distributions is at least fi(fc) standard de- 
viations. These results generally make heavy use of the generative model and particular properties of 
the distributions. In this paper, we show that a simple clustering algorithm works without assuming 
any generative (probabilistic) model. Our only assumption is what we call a "proximity condition": the 
projection of any data point onto the line joining its cluster center to any other cluster center is f2(fe) 
standard deviations closer to its own center than the other center. Here the notion of standard deviations 
is based on the spectral norm of the matrix whose rows represent the difference between a point and the 
mean of the cluster to which it belongs. We show that in the generative models studied, our proximity 
condition is satisfied and so we are able to derive most known results for generative models as corollaries 
of our main result. We also prove some new results for generative models - e.g., we can cluster all but a 
small fraction of points only assuming a bound on the variance. Our algorithm relies on the well known 
fc-means algorithm, and along the way, we prove a result of independent interest - that the fc-means 
algorithm converges to the "true centers" even in the presence of spurious points provided the initial 
(estimated) centers are close enough to the corresponding actual centers and all but a small fraction of 
the points satisfy the proximity condition. Finally, we present a new technique for boosting the ratio of 
inter-center separation to standard deviation. This allows us to prove results for learning mixture of a 
class of distributions under weaker separation conditions. 



"This work was done while the author was visiting Microsoft Research India Lab. 



1 Introduction 



Clustering is in general a hard problem. But, there has been a lot of research (see Section [3] for references) 
on proving that if we have data points generated by a mixture of k probability distributions, then one can 
cluster the data points into the k clusters, one corresponding to each component, provided the means of the 
different components are well-separated. There are different notions of well-separated, but mainly, the (best 
known) results can be qualitatively stated as: 
"If the means of every pair of densities are at least poly(fc) times standard deviations apart, then we can 

learn the mixture in polynomial time." 
These results generally make heavy use of the generative model and particular properties of the distributions 
(Indeed, many of them specialize to Gaussians or independent Bernoulli trials). In this paper, we make 
no assumptions on the generative model of the data. We are still able to derive essentially the same result 
(loosely stated for now as): 

"If the projection of any data point onto the line joining its cluster center to any other cluster center is Q(k) 
times standard deviations closer to its own center than the other center (we call this the "proximity 
condition"), then we can cluster correctly in polynomial time." 
First, if the n points to be clustered form the rows of an n x d matrix A and C is the corresponding matrix 
of cluster centers (so each row of C is one of k vectors, namely the centers of k clusters) then note that the 
maximum directional variance (no probabilities here, the variance is just the average squared distance from 
the center) of the data in any direction is just 

i \\A — <7ll 2 
- • Max^^P - C) ■ v\ 2 = U !L, 

n n 

where ||^4 — C\\ is the spectral norm. So, spectral norm scaled by l/y/n will play the role of standard 
deviation in the above assertion. To our knowledge, this is the first result proving that clustering can be done 
in polynomial time in a general situation with only deterministic assumptions. It settles an open question 
raised in IKV091 . 

We will show that in the generative models studied, our proximity condition is satisfied and so we are 
able to derive all known results for generative models as corollaries of our theorem (with one qualification: 
whereas our separation is in terms of the whole data variance, often, in the case of Gaussians, one can make 
do with separations depending only on individual densities' variances - see Section [3]) 

Besides Gaussians, the planted partition model (defined later) has also been studied; both these distri- 
butions have very "thin tails" and a lot of independence, so one can appeal to concentration results. In 
section 16.31 we give a clustering algorithm for a mixture of general densities for which we only assume 
bounds on the variance (and no further concentration). Based on our algorithm, we show how to classify all 
but an e fraction of points in this model. Section [3] has references to recent work dealing with distributions 
which may not even have variance, but these results are only for the special class of product densities, with 
additional constraints. 

One crucial technical result we prove (Theorem 15.51 ) may be of independent interest. It shows that 
the good old A;— means algorithm IILlo821 converges to the "true centers" even in the presence of spurious 
points provided the initial (estimated) centers are close enough to the corresponding actual centers and all 
but an e fraction of the points satisfy the proximity condition. Convergence (or lack of it) of the A;— means 
algorithm is again well-studied ( HORSS06I IAV061 IDas03l IHPS051 ). The result of HORSS06I1 (one of the few 
to formulate sufficient conditions for the A;— means algorithm to provably work) assumes the condition that 
the optimal clustering with k centers is substantially better than that with fewer centers and shows that one 
iteration of A;— means yields a near-optimal solution. We show in section [6741 that their condition implies 
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proximity for all but an e fraction of the points. This allows us to prove that our algorithm, which is again 
based on the k— means algorithm, gives a PTAS. 

The proof of Theorem [53] is based on Theorem 15.41 which shows that if current centers are close to the 
true centers, then misclassified points (whose nearest current center is not the one closest to the true center) 
are far away from true centers and so there cannot be too many of them. This is based on a clean geometric 
argument shown pictorially in Figured Our main theorem in addition allows for an e fraction of "spurious" 
points which do not satisfy the proximity condition. Such errors have often proved difficult to account for. 

As indicated, all results on generative models assume a lower bound on the inter-center separation in 
terms of the spectral norm. In section|7J we describe a construction (when data is from a generative model - a 
mixture of distributions) which boosts the ratio of inter-center separation to spectral norm. The construction 
is the following: we pick two sets of samples A\, A2, . . . A n and B\, B2, ■ ■ ■ B n independently from the 
mixture. We define new points X\, X2, ■ ■ ■ X n , where Xi is defined as (A[ ■ B' X ,A\- B' 2 , ■ ■ ■ A\ ■ B' n ), where ' 
denotes that we have subtracted the mean (of the mixture.) Using this, we are able to reduce the dependence 
of inter-center separation on the minimum weight of a component in the mixture that all models generally 
need. This technique of boosting is likely to have other applications. 



For a matrix A, we shall use | \A\ | to denote its spectral norm. For a vector v, we use \v\ to denote its length. 
We are given n points in R d which are divided into k clusters - Tx, T2, . . . , Let \i r denote the mean of 
cluster T r and n r denote \T r \. Let A be the n x d matrix with rows corresponding to the points. Let C be 
the n x d matrix where Ci = ji r , for all i E T r . We shall use A{ to denote the i th row of A. Let 



where c is a large enough constant. 

Definition 2.1 We say a point Ai £ T r satisfies the proximity condition if for any s 7^ r, the projection of 
Ai onto the p, r to \x s line is at least A rs closer to fi r than to fi s . We let G (for good) be the set of points 
satisfying the proximity condition. 

Note that the proximity condition implies that the distance between fi r and fi s must be at least A rs . We 
are now ready to state the theorem. 

Theorem 2.2 If\G\ > (1 — e) ■ n, then we can correctly classify all but 0(k 2 e ■ n) points in polynomial 
time. In particular, ife = 0, all points are classified correctly. 

Often, when applying this theorem to learning a mixture of distributions, A will correspond to a set of n 
independent samples from the mixture. We will denote the corresponding distributions by Fx, . . . , F^, and 
their relative weights by w±, . . . , w^- Often, a r will denote the maximum variance along any direction of 
the distribution F r , and a will denote max r o T . We denote the minimum mixing weight of a distribution as 



3 Previous Work 

Learning mixture of distributions is one of the central problems in machine learning. There is vast amount of 
literature on learning mixture of Gaussian distributions. One of the most popular methods for this is the well 



2 Preliminaries and the Main Theorem 
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known EM algorithm which maximizes the log likelihood function BDLR77H However, there are few results 
which demonstrate that it converges to the optima solution. Dasgupta [Das99] introduced the problem of 
learning distributions under suitable separation conditions, i.e., we assume that the distance between the 
means of the distributions in the mixture is large, and the goal is to recover the original clustering of points 
(perhaps with some error). 

We first summarize known results for learning mixtures of Gaussian distributions under separation con- 
ditions. We ignore logarithmic factors in separation condition. We also ignore the minimum number of 
samples required by the various algorithms - they are often bounded by a polynomial in the dimension 
and the mixing weights. Let ay be the maximum variance of the Gaussian F r in any direction. Dasgupta 
[Das99] gave an algorithm based on random projection to learn mixture of Gaussians provided mixing 
weights of all distributions are about the same, and — fj,j\ is $7((<7j + Uj) ■ \/n). Dasgupta and Schul- 
man [DS07] gave an EM based algorithm provided — fij\ is fi(((Tj + aj) ■ n*). Arora and Kannan 
[AK01] also gave a learning algorithm with similar separation conditions. Vempala and Wang [VW04] 
were the first to demonstrate the effectiveness of spectral techniques. For spherical Gaussians, their algo- 
rithm worked with a much weaker separation condition of + Uj) • ki) between /ij and fij. Achlioptas 
and McSherry [AM05] extended this to arbitrary Gaussians with separation between m and fij being at 



[BVO|0 gave a learning algorithm where the separation only depends on the variance perpendicular to a 
hyperplane separating two Gaussians (the so called "parallel pancakes problem"). 

Much less is known about learning mixtures of heavy tailed distributions. Most of the known results as- 
sume that each distribution is a product distribution, i.e., projection along co-ordinate axes are independent. 
Often, they also assume some slope condition on the line joining any two means. These slope conditions 
typically say that the unit vector along such lines does not lie almost entirely along very few coordinates. 
Such a condition is necessay because if the only difference between two distributions were a single coordi- 
nate, then one would require much stronger separation conditions. Dasgupta et. al. [DHKS05] considered 
the problem of learning product distributions of heavy tailed distributions when each component distribution 
satisfied the following mild condition : P[\X — p\ > aR] < Here R is the half -radius of the distribution 
(these distributions can have unbounded variance). Their algorithm could classify at least (1 — e) fraction 



of the points provided the distance between any two means is at least Q, I I . Here R is the maximum 



half-radius of the distributions along any coordinate. Under even milder assumptions on the distributions 
and a slope condition, they could correctly classify all but e fraction of the points provided the correspond- 



time. This problem was resolved by Chaudhuri and Rao ffCR08H . Dasgupta et. al. MDHKM071 considered 
the problem of classifying samples from a mixture of arbitrary distributions with bounded variance in any 
direction. They showed that if the separation between the means is (ak) and a suitable slope condition 
holds, then all the samples can be correctly classified. Their paper also gives a general method for bounding 
the spectral norm of a matrix when the rows are independent (and some additional conditions hold). We will 
mention this condition formally in Section |6]and make heavy use of it. 

Finally, we discuss the planted partition model [McSOl ]. In this model, an instance consists of a set of 
n points, and there is an implicit partition of these n points into k groups. Further, there is an (unknown) 
k x k matrix of prababilities P. We are given a graph G on these n points, where an edge between two 
vertices from groups i and j is present with probability P,j. The goal is to recover the actual partition of 
the points (and hence, an approximation to the matrix P as well). We can think of this as a special case of 






Their algorithm, however, requires exponential (in d and k) amount of 
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learning mixture of k distributions, where the distribution F r corresponding to the r part is as follows : F r 
is a distribution over {0, l} n , one coordinate corresponding to each vertex. The coordinate corresponding to 
vertex u is set to 1 with probability P^ , where j denotes the group to which u belongs. Note that the mean 
of F r , [i r , is equal to the vector (P r ip( u ))uev> where tp(u) denotes the group to which the vertex u belongs. 
McSherry[McS01] showed that if the following separation condition is satisfied, then one can recover the 
actual partition of the vertex set with probability at least 1 — 5 - for all r,s,r^s 

2 2 / 1 n\ 
K-Msl >c-a ■ k ■ [ hlog- , (1) 

where c is a large constant, w m [ n is such that every group has size at least w m i n -n, and a 2 denotes maxjj Pij. 

There is a rich body of work on the A;-means problem and heuristic algorithms for this problem (see for 
example [KSS10, ORSS06] and references therein). One of the most widely used algorithms for this prob- 
lem was given by Lloyd [Llo82]. In this algorithm, we start with an arbitrary set of k candidate centers. Each 
point is assigned to the closest candidate center - this clusters the points into k clusters. For each cluster, 
we update the candidate center to the mean of the points in the cluster. This gives a new set of k candidate 
centers. This process is repeated till we get a local optimum. This algorithm may take superpolynomial time 
to converge HAV061 . However, there is a growing body of work on proving that this algorithm gives a good 
clustering in polynomial time if the initial choice of centers is good HAV071 IADK09I IORSS061 . Ostrovsky 
et. al. BORSS061 showed that a modification of the Lloyd's algorithm gives a PTAS for the fc-means prob- 
lem if there is a sufficiently large separation between the means. Our result also fits in this general theme - 
the A;-means algorithm on a choice of centers obtained from a simple spectral algorithm classifies the point 
correctly. 

4 Our Contributions 

Our main contribution is to show that a set of points satisfying a deterministic proximity condition (based on 
spectral norm) can be correctly classified (Theorem I2.2I ). The algorithm is described in Figure [TJ It has two 
main steps - first find an initial set of centers based on SVD, and then run the standard fc-means algorithm 
with these initial centers as seeds. In Section [51 we show that after each iteration of the fc-means algorithm, 
the set of centers come exponentially close to the true centers. Although both steps of our algorithm - SVD 
and the fc-means algorithm - have been well studied, ours is the first result which shows that combining 
the two leads to a provably good algorithm. In Section [6l we give several applications of Theorem 12.21 
We have the following results for learning mixture of distriutions (we ignore poly-logarithmic factors in the 
discussion below) : 

• Arbitrary Gaussian Distributions with separation fl ^ ^^— j '■ as mentioned above, this matches 
known results [AM05, KSV08] except for the fact that the separation condition between two dis- 
tributions depends on the maximum standard deviation (as compared to standard deviations of these 
distributions only). 

• Planted distribution model with separation Vt ( ~^=. ) : this matches the result of McSherry MMcSOll 
except for a \fk factor which we can also remove with a more careful analysis. 

• Distributions with bounded variance along any direction : we can classify all but an e fraction of points 
if the separation between means is at least ft (^7=) ■ Although results are known for classifying (all but 
a small fraction) points from mixtures of distributions with unbounded variance [DHKS05, CR08], 
such results work for product distributions only. 
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• PTAS using the fc-means algorithm : We show that the separation condition of Ostrovsky et. al. IIORSS06I1 
is stronger than the proximity condition. Using this fact, we are also able to give a PTAS based on the 
A;-means algorithm. 

Further, ours is the first algorithm which applies to all of the above settings. In Section |7J we give a general 
technique for working with weaker separation conditions (for learning mixture of distributions). Under 
certain technical conditions described in Section |7J we give a construction which increases the spectral 
norm of A — C at a much faster rate than the increase in inter-mean distance as we increase the number of 
samples. As applications of this technique, we have the following results : 



• Arbitrary Gaussians with separation Q [ak ■ log ^p— J : this is the first result for arbitrary Gaussians 
where the separation depends only logarithmically on the minimum mixture weight. 

• Power-law distributions with sufficiently large (but constant) exponent 7 (defined in equation (fT3T >) : 
We prove that we can learn all but e fraction of samples provided the separation between means is 

Q,[ak ■ I log — - — h T- • For large values of 7, it significantly reduces the dependence on e. 



We expect this technique to have more applications. 



5 Proof of Theorem 2.2 



Our algorithm for correctly classifying the points will run in several iterations. At the beginning of each 
iteration, it will have a set of k candidate points. By a Lloyd like step, it will replace these points by another 
set of k points. This process will go on for polynomial number of steps. 



1. (Base case) Let Ai denote the projection of the points on the best fc-dimensional subspace found 
by computing SVD of A. Let u r ,r = 1, . . . , k, denote the centers of a (near)-optimal solution to 
the fc-means problem for the points Ai. 

2. Vorl = 1,2,... do 

(i) Assign each point A4 to the closest point among u r ,r = 1, . . . , k. Let S r denote the set of 
points assigned to v r . 

(ii) Define r\ r as the mean of the points S r . Update rj r , r = 1, . . . , k as the new centers, i.e., set 
v r = r] r for the next iteration. 



Figure 1: Algorithm Cluster 

The iterative procedure is described in Figured] In the first step, we can use any constant factor approx- 
imation algorithm for the fc-means problem. Note that the algorithm is same as Lloyd's algorithm, but we 
start with a special set of initial points as described in the algorithm. We now prove that after the first step 
(the base case), the estimated centers are close to the actual ones - this case follows from IKV09I , but we 
prove it below for sake of completeness. 

Lemma 5.1 (Base Case) After the first step of the algorithm above, 

K - Vr\ < 20Vk ■ ^ A 

fru- 
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Proof. Suppose, for sake of contradiction, that there exists an r such that all the centers v\ , . . . , vj. are at 
least 20 ^- c H distance away from fj, r . Consider the points in T r . Suppose A{ G T r is assigned to the 



center v c u\ in this solution. The assignment cost for these points in this optimal fc-means solution is 

^2 \Ai - U c{i) \ 2 = ^2 \(Vr ~ V c (i)) ~ {Hr ~ Ai)\ 2 
i£T r i£T r 



> 20k ■ \\A - C\\ 2 - 5k ■ \\A - C\\ 2 = 15k\\A - C\\ 2 (3) 

where inequality $2$ follows from the fact that for any two numbers a, b, (a—b) 2 > ^-—b 2 ; and inequality ([3]) 
follows from the fact that \\A — C\\ 2 F < 5k • \\A — C\\ 2 . But this is a contradiction, because one feasible 
solution to the /c-means problem is to assign points in Ai,i E T s to fi s for s = l,...,k- the cost of this 
solutionis \\A-C\\ 2 F < 5k\\A-C\\ 2 . m 

Observe that the lemma above implies that there is a unique center u r associated with each fi r . We now 
prove a useful lemma which states that removing small number of points from a cluster T r can move the 
mean of the remaining points by only a small distance. 

Lemma 5.2 Let X be a subset ofT r . Let m(X) denote the mean of the points in X. Then 

\m(X)-fi r \ < \\ A ~ C \\ 



/\X\ 

Proof. Let u be unit vector along m(X) — fj, r . Now, 

\(A-C)-u\ > (j2((Ai-fj, r )-u) 2 y > ^2 \{Ai - /j, r ) ■ u\j > J\X\.\m(X)-Hr\ 

But, \ (A- C) -u\ < \ \A-C\\. This proves the lemma. ■ 

Corollary 5.3 Let Y C T s such that \T S — Y\ < 5 ■ n s , where 5 < ^. Let m(Y) denote the mean of the 
points in Y. Then 

, 2-VS-\\A-C\\ 



Proof. Let X denote T s - Y. We know that p, s ■ \T S \ = \X\ ■ m(X) + \Y\ ■ m(Y). So we get 



|m(Y)-/i s | = ^.| m (X)-/x s | < ^.||A-C|| 

where the inequality above follows from Lemma [5^21 The result now follows because \Y\ > ■ 

Now we show that if the estimated centers are close to the actual centers, then one iteration of the second 
step in the algorithm will reduce this separation by at least half. 
Notation : 

• v\ , U2 , . . . Vk denote the current centers at the beginning of an iteration in the second step of the 
algorithm, where v r is the current center closest to \i r . 
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• S r denotes the set of points Ai for which the closest current center is u r . 

• 7] r denotes the mean of points in S r ; so rj r are the new centers. Let S r = \fx r — u r \. 

The theorem below shows that the set of misclassified points (which really belong to T r , but have u s ,s ^ r, 
as the closest current center) are not too many in number. The proof first shows that any misclassified point 
must be far away from \i r and since the sum of squared distances from \i T for all points in T r is bounded, 
there cannot be too many. 

Theorem 5.4 Assume that 5 r + 5 S < A rs /16 for all r ^ s. Then, 

|r r n S ,n |< "-''1- c " 2 ^ + <> (4) 

l\ rs \l-L r l-l s \ 

Further, for any W C T r fl <S S , 

Mf)-,|<^P1 (5) 

Proof. Let v denote the projection of vector v to the affine space V spanned by (ii, . . . /i^, V\ y . . . , and 
771, 772, . . . t]k- Assume A% S T r n S s n G. Splitting Ai into its projection along the line /x r to /x s and the 
component orthogonal to it, we can write 

= ^(fJh + Ms) + Kl*r ~ Ms) + u, 
where u is orthogonal to fj, r — fx s . Since Ai is closer to u s than to u r , we have 

Ai ■ (u s - u r ) > \{v a - u r ) ■ {y r + u s ) 
i.e., + // 8 ) • - v r ) + A(/x r - /i s ) • (u s -v r ) + w (v s - u r ) > \{v s - u r ) ■ (u s + u r ). 

We have u • (u s — u r ) = u ■ {{y s — fi s ) — (u r — fi r )) since u is orthogonal to /j, r — fx s . The last quantity 
is at most \u\5, where 5 = 5 r + 5 S . Substituting this we get 

\ ■ (fl r +H s -V r - U s ) ■ (U s - U r ) + X(p r - fj, s ) ■ (u s - U r ) + \u\ ■ 5 > 

i.e., C + i\/J, r - Hs\ - A|/i r - Hs\ 2 + XS\/j, r - [jl s \ + \u\5 > 0. (6) 



Now, 



\Ai [i r 



II ^ I 1 Q - 1 .5 /i r LL S \ , j-r, 

> \u\ > - ■ \Hr - /i s | z - \\Hr ~ Ms| - o n using 46|) 

z z 

^ Ay S |//r Ms I 



645 

where the last inequality follows from the fact that A > | (proximity condition) and the assumption 

that 5 < A rs /16. Therefore, we have 

ieT r ns s nG ieT r 
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Figure 2: Misclassified 



If we take a basis u\,U2, ...u p of V, we see that X)?.eT r l-^i — C«| 2 = St=i SieT r ~~ C«) ' ^l 2 = 
St=i 1 1^4 — C\ \ 2 < 3k\\A — C\\ 2 , which proves the first statement of the theorem. 
For the second statement, we can write m(W) as 

m(W) =\{Hr + Ms) + A(/x r - fi a ) + u, 

where, u is orthogonal to fi r — fi s . Since m(W) is the average of points in S s , we get (arguing as for Q): 

\U\ > \U r — LL S \ . 



Now, we have 

|m(W) - fi r \ 2 = \u\ 2 + ( A - - ) \n r — fi s \ 2 , and \m(W) - fi s \ 2 = \u\ 2 + ( A + - ) \fi r — fi s 



1X2 



2 



If A < 1/4, then clearly, \m(W) - fi s \ < 4\m(W) - fi r \. If A > 1/4, then we have \u\ > -^\n r - ^s\ 2 > 
\ ■ (\ + ±) • |/i r - /x a | because Itr^tA > 16. This again yields \m(W) - fi s \ < A\u\ < A\m(W) 
Now, by Lemma 15^21 we have \m(W) — fi r \ < -^=11, so the second statement in the theorem. ■ 



fl r 



We are now ready to prove the main theorem of this section which will directly imply Theorem l2.2l This 
shows that k— means converges if the starting centers are close enough to the corresponding true centers. To 
gain intuition, it is best to look at the case e = 0, when all points satisfy the proximity condition. Then the 
theorem says that if \v s — fi s \ < ^ll^z^H f or a n s> then \r/ s — fi s \ < 7 ^ A Z^ , thus halving the upper bound 



2,/n 

of the distance to fi s in each iteration. 
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Theorem 5.5 // 



<L < max 



1-\\A-C\\ 



800Vk. 



en 



\A-C\\ 



'n. 



for all s and a parameter 7 < ck /50, then 



|i7 s — /u s | < max 



i-U-c\\ 

2,/nZ 



, 800\/fc 



en 



\A-C\\ 



for all s. 

Proof. Let n rs , fi rs denote the number and mean respectively of T r n5 s flG and n' rs , a' rs of (T r \G)nS s . 
Similarly, define n ss and fi ss as the size and mean of the points in T s n S s . We get 



We have 

|Mss Ms I — 



V\S s \-n SSnA .W0\\A-C\\ , 100||A-C|| 
A - C\\ ; \n rs - (is\ < — ; \u rs - ii s \ < - 



/n r 



'n' 



where the first one is from Corollary 15.31 (it is easy to check from the first statement of Theorem 15.41 that 
n ss > n s /2) and the last two are from the second statement in Theorem [53] 
Now using the fact that length is a convex function, we have 



i ^ llss I I 1 nrs I I 1 ' L rs 1 / 

\Vs-fls\ < i^tIMss - Ms| + 2^ Tol^ rs ~ + 2^ rollers - Ms 
Ps| r^s^^ r P*l 



< 200||A-C|| 



+ E — + E 



'n' 



n s 



< 4oop-c||[£^ + £ 



'n' 



n s 



since \S S \ — n ss = J2r^s n rs + J2s n 'rs- Let us look at each of the terms above. Note that n rs < 
24 C fcp-C||W(^ s )3 (usi xheo^g^,, So 



E 



1l s 



< 



< 



< 



< 



Wck\\A - C\\ max(5 n 5 s ) 



r -± s ^rs " I Mr Ms 



f>yfck\\A - C\\< 



E 
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800\/fc 



A rs • I Mr — Ms I V \/min(n r , n s ) min(n r ,n s ) 



5Vck s-^ min(n r ,n s ) 
n s , c 2 k 2 

7 \Jken 
+ 



7 



+ 



800^ 



en 



y/mm(n r ,n s ) min(n r , n s ) 



cJn s n s 



Also, note that Y^ r n 'rs — en - So we get ^ r w " rs < ■ Assuming c to be large enough constant 
proves the theorem. ■ 
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Now we can easily finish the proof of Theorem 12.21 Observe that after the base case in the algorithm, 
the statement of Theorem 15.51 holds with 7 = 20v&. So after enough number of iterations of the second 
step in our algorithm, 7 will become very small and so we will get 



6 S < 800\/ken ■ M — ^li. 



for all s. Now substituting this is Theorem |5.41 we get 



lTr n s, n G\ < «^±M=^^L < m 

Af s \p, r - p, s \ A • mm(n r ,n 8 y 



Summing over all pairs r, s implies Theorem 12. 21 



6 Applications 

We now give applications of Theorem 12.21 to various settings. One of the main technical steps here would 
be to bound the spectral norm of a random n x d matrix Y whose rows are chosen independently. We use 
the following result from [DHKM07]. Let D denote the matrix 2£[Y T Y]. Also assume that n > d. 

Fact 6.1 Let 7 be such that maxj | Y| < jy/n and \ \D\\ < j 2 n. Then \ \Y\\ < 7 • \fn • polylog(n) with high 
probability. 



6.1 Learning in the planted distribution model 

In this section, we show that McSherry's result [McSOl ] can be derived as a corollary of our main theorem. 
Consider an instance of the planted distribution model which satisfies the conditon (Q]). We would like to 
show that with high probability, the points satisfy the proximity condition. Fix a point A{ G T r . We will 
show that the probability that it does not satisfy this condition is at most -. Using union bound, it will then 
follow that the proximity condition is satisfied with probability at least 1 — 5. 

Let s / r. Let v denote the unit vector along \i r — \i s . Let L rs denote the line joining \i r and \i s , and 
Ai be the projection of Ai on L rs . The following result shows that the distance between Ai and \i r is small 
with high probability. 

Lemma 6.2 Assume a > 3l ° 1 f n , where a = maxjj \JP%j. With probability at least 1 — -4r, 

\Ai - Hr\ <ck- a ■ (log fy) + ) , 

where c is a large constant. 

Proof. For a vector Ai, we use Aij to denote the coordinate of A% at position j. Define fi r j similarly. 
First observe that \ Ai — p, r \ = \v ■ (Ai — p, r )\. The coordinates of v corresponding to points belonging to a 
particular cluster are same - let v t denote this value for cluster T t . So we get 

k 

\v(Ai-Hr)\ <J2\ yt 
t=l 



< 



jeT t 



k 

E 

*=i 



Prt ■ n t 
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where n t denotes the size of cluster T t . The last inequality above follows from the fact that \v\ = 1. So, 
1 > J2jeT t v j = ( w *) 2 " n t- Now, if Ai does not satisfy the condition of the lemma, then there must be some 
t for which 



> ca^/rTf (log (^] + 



Now note that A+j, j G T r are i.i.d. 0-1 random variables with mean P rt . Now we use the following version 
of Chernoff bound : let X\, . . . , X[ be i.i.d. 0-1 random variables, each with mean p. Then 



Pr 



=i 



> rj ■ Ip 



< 



e -v lp/4 if v < 2e - 1 



For us, n 



otherwise 

p • (log (j) + ^— j ■ If f] < 2e — 1, the probability of this event is at most 

6 



c 2 a 2 



exp 



n 



< 



n 



2 ' 



Now, assume rj > 2e — 1. In this case the probability of this event is at most 



-ca . -Hi- 1 



where we have assumed that cr > 31 ° gn (we need this assumption anyway to use Wigner's theorem for 
bounding ||^4 - C\\). ■ 



Assuming 



we see that 



\fJr - fJ, s \ > 4ck ■ a ■ [ log ( ^] + 1. — 



I A: 



Ms 



I A 



Mr | > 1= r 



with probability at least 1 — A-. Here, we have used the fact that | \A — C\ \ < d ■ a^/n with high probability 
(Wigner's theorem). Now, using union bound, we get that all the points satisfy the proximity condition with 
probability at least 1 — 5. 

Remark : Here we have used C as the matrix whose rows are the actual means /x r . But while applying 
Theorem l2.2l C should represent the means of the samples in A belonging to a particular cluster. The error 
incurred here can be made very small and will not affect the results. So we shall assume that fi r is the actual 
mean of points in T r . Similar comments apply in other applications described next. 



6.2 Learning Mixture of Gaussians 

We are given a mixture of k Gaussians Fx, . . . ,Fk in d dimensions. Let the mixture weights of these 
distributions be wx, . . . , and /ix, • • • , Mfc denote their means respectively. 

Lemma 6.3 Suppose we are given a set of n = poly ( — — ) samples from the mixture distribution. Then 

\ ^min / 

these points satisfy the proximity condition with high probability if 



I Mr 



Ms | > 



cka n 



rpolylog 



w n 



for all r,s,r ^ s. Here cr max is the maximum variance in any direction of any of the distributions F r 
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Proof. It can be shown that ||^4 — C|| is O {a^^y/n ■ polylog (jj^r jj with high probability (see 
[DHKM07]). Further, let p be a point drawn from the distribution F r . Let L rs be the line joining \i r 
and fi s . Let p be the projection of p on this line. Then the fact that F r is Gaussian implies that \p — /i r | < 
o" max polylog(n) with probability at least 1 — 3^.. It is also easy to check that the number of points from F r 
in the sample is close to w r n with high probability. Thus, it follows that all the points satisfy the proximity 
condition with high probability. ■ 

The above lemma and Theorem 12.21 imply that we can correctly classify all the points. Since we shall 
sample at least poly(d) points from each distribution, we can learn each of the distribution to high accuracy. 

6.3 Learning Mixture of Distributions with Bounded Variance 

We consider a mixture of distributions F\, . . . , Fj. with weights w\, . . . , Wk- Let a be an upper bound on 
the variance along any direction of a point sampled from one of these distributions. In other words, 



a > E 

for all distributions F r and each unit vector v. 



((x - fj, r ) ■ v) 2 



Theorem 6.4 Suppose we are given a set of n = poly ( — — ) samples from the mixture distribution. As- 

sume that 
provided 



sume that a > poly ^=( n \ Then there is an algorithm to correctly classify at least 1 — e fraction of the points 



l l ^ 40ka i i ( d 

Mr - AM > ^^Polylog - 

for all r,s,r ^ s. Here s is assumed to be less than t^min- 

Proof. The algorithm is described in Figure[3] We now prove that this algorithm has the desired properties. 
Let A denote the n x d matrix of points and C be the corresponding matrix of means. We first bound the 
spectral norm of A — C. The bound obtained is quite high, but is probably tight. 



1. Run the first step of Algorithm Cluster on the set of points, and let z/i, . . . , vj. denote the centers 
obtained. 

2. Remove centers v r (and points associated with them) to which less than d 2 log d points are as- 
signed. Let v\, . . . , Uk' be the remaining centers. 

3. Remove any point whose distance from the nearest center in v\, . . . , is more than ^^?=p 

4. Run the algorithm Cluster on the remaining set of points and output the clustering obtained. 



Figure 3: Algorithm for Clustering points from mixture of distributions with bounded variance. 



Lemma 6.5 With high probability, \\A — C\\ < o\J dn ■ polylog(n). 

Proof. We use FactEU Let Y denote A-C. Note that \Y$ = Ei=i(A? - Cy) 2 • Since E[(Aq -Cy) 2 ] < 
a 2 (because it is the variance of this distribution along one of the coordinate axes), the expected value of 
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|yj| 2 is at most a 2 d. Now using Chebychev's inequality, we see that maxj \Y{\ > cr\/dnpolylog(n) is at 
most p i y i^g ( n ) ■ Now we consider Y T Y = E\Y?Y\\ (recall that we are treating Yi as a row vector). So 

if v is a unit (column) vector, then v T E[Y T Y]v = £^ E[\Y{ ■ v\ 2 ]. But E[\Yi ■ v\ 2 ] is just the the variance 
of the distribution corresponding to A4 along v. So this quantity is at most a 2 for all v. Thus, we see that 
o 2 n. This proves the lemma. ■ 

The above lemma allows us to bound the distance between fi r and the nearest mean obtained in Step 1 
of the algorithm. The proof proceeds along the same lines as that of Lemma [5TTI 

Lemma 6.6 For each [i r , there exits a center v T such that u r is not removed in Step 2 and \fi r — u r \ < 



. polylog(n). 

Proof. Suppose the statement of the lemma is false for [i r . At most kd 2 logd, which is much less than 
\T r \, points are assigned to a center which is removed in Step 2. The remaining points in T r are assigned to 
centers which are not removed. So, arguing as in the proof of Lemma [5TTI the clustering cost in step 1 for 
points in T r is at least 



\T r \ — kd log d (WaVdk . . . . \ ^ . j , 9 9 „ , , / x ,, - 

— 5 — • ^ • polylog(n) -Y^^r-Ai) 2 > 50a 2 dkn • polylog(n) - ||A - C\\' 



\ v / ieT r 

> Wk\\A-C\\ 2 

where the last inequality follows from Lemma [631 But, as in the proof of Lemma [5TT1 this is a contra- 
diction. ■ 

Note that in the lemma above, v r may not be unique for different means /x r . Call a point A4 £ T r bad if 
\A - Mr I > Call a point A { G T r nice if |A - fj, r \ < ^ 

Lemma 6.7 The number of bad points is at most d ■ log d with high probability. The number of points which 
are not nice is at most d 2 log d with high probability. The number of nice points that are removed is at most 
4kd 2 log d. 

Proof. Arguing as in the proof of Lemma 1631 the probability that \Ai — Cj| > a ■ y/n is at most -. So the 
expected number of bad points is at most d. The first statement in the lemma now follows from Chernoff 
bound. The second statement is proved similarly. At most kd 2 logd points are removed in Step 1. Now 
suppose A-i is nice. Then Lemma |6T6l implies that it will not be removed in Step 3 (using Lemma [631 and 
the fact that n is large enough). ■ 

Corollary 6.8 With high probability the following event happens : suppose v r does not get removed in Step 
2. Then there is a mean p, r such that | fj, r — v r \ < ■ 

Proof. Since v T is not removed, it has at least one nice point Ai assigned to it (otherwise it will have 
at most d 2 log d points assigned to it and it will be removed). The distance of Ai to the nearest mean 
is at most ^7=, and Lemma I6T61 implies that there is a center v s which is not removed and for which 



Ms 



|Ms — v s\ < §^/§- So, \v s — Ai\ — Since v r is the closest center to A{, \v r — A{\ < ^5 as well. Now, 
I v r — \i s I < I v r — Ai I + I Ai — fi s I and the result follows. ■ 

Let A' be the set of points which are remaining after the third step of our algorithm. We now define a new 
clustering T[, . . . ,T£ of points in A'. This clustering will be very close to the actual clustering T%, . . . , 2& 
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and so it will be enough to correctly cluster a large fraction of the points according to this new clustering. 
We define 



T' r = {Ai 6 T r : Ai is not bad and does not get removed } U {Ai : Ai is a bad point which 

does not get removed and the nearest center among the actual centers is /x r }. 

Let fi' r be the mean of T' T and C be the corresponding matrix of means. 

Lemma 6.9 With high probability, for all r, \\A' — C'\\ < 0(a • y/n • polylog(n)), and |/x r — /v| < 

Wakd 2 log d 
Ey/n 

Proof. We first prove the second statement. The points in T' r contain all the points in T r except for at 
most 5kd 2 log d points (Lemma I6.7I ). First consider the points in T r — T' T which are not bad. Since all 
these points are at distance less than Oyfn from fj, r , the removal of these points shifts the mean by at most 
bak^/nd^ logd ^ o-fcd jog d ^ j v j ow ma y con t am some bad points as well. First observe that any such bad 

point must be at most 3< ^" away from fi r . Indeed, the reason why we retained this bad point in Steps 2 

and 3 is because it is at distance at most from u r from some r. So combined with Corollary 16.81 this 
statement is true. So these bad points can again shift the mean by a similar amount. This proves the second 
part of the lemma. 

Now we prove the first part of the lemma. Break A' — C into two parts - A' B — C' B and A' G — C' G - the 
rows of A 1 — C which are from bad points and the remaining rows (good) respectively. A' B — C' B has at most 
dlogd rows, and each row (as argued above) has length at most 4< ^" . So \\A' B — C' B \\ < 3a\/nlog(d). 
Now consider A' G — C' G . Let C G be the rows of the original matrix C corresponding to to A' G . Then 
I \A' G - C' G \ | < | \A' G - C G \ | + | \C' G - C G \ |. Each row of C G - C' G has length at most 10ak £ d ^ d and so its 

spectral norm is at most 10akd ^ lo s rf ; which is quite small (doesn't involve n at all). So it remains to bound 
| \A' G — Cq\\- Let Z be the rows of A — C which correspond to points which are not bad. Note that the rows 
of Z are independent and have length at most a^fn. So applying Fact l6.1l and arguing as in Lemma [631 we 
can show that ||Z|| is at most a^/n ■ polylog(n). Now observe that A' G — C G is obtained by picking some 
rows of Z (by a random process), and so its spectral norm is at most that of 1 1 Z \ \ . This proves the lemma. ■ 

We are now ready to prove the main theorem. We would like to recover the clustering C (since C and 
C agree on all but the bad points). We argue that at least (1 — e) fraction of the points satisfy the proximity 
condition. Indeed, it is easy to check that at least (1 — e) fraction of the points Ai are at distance at most 
Aa J? from the corresponding mean \x r and satisfy the proximity condition. Since the distance between fj, s 

and fx' s is very small (dependent inversely on n), and Ai is only Aa J^ far from fi r , it will satisfy the proximity 
condition for A', C' as well (provided n is large enough). ■ 



6.4 Sufficient conditions for convergence of k— means 

As mentioned in Section [3l Ostrovsky et. al. [ORSS06] provided the first sufficient conditions under which 
they prove effectiveness of (a suitable variant of) the A;— means algorithm. Here, we show that their condi- 
tions are (much) stronger than the proximity condition. We first describe their conditions. Let be the 
optimal cost of the fc-means problem (i.e., sum of distance squared of each point to nearest center) with k 
centers. They require: 

A fc < eAjfc_i. 
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Claim 6.10 The condition above implies the proximity condition for all but e fraction of the points. 

Proof. Suppose the above condition is true. One way of getting a solution with k — 1 centers is to remove 
a center fj, r and move all points in T r to the nearest other center fj, s . Now, their condition implies 

K-/^| 2 > — ||A-C||J. Vs^r. 
n r ■ e 

If some e fraction of T r do not satisfy the proximity condition, then the distance squared of each such point 
to p, r is at least the distance squared along the line p, r to p, s which is at least (1/4) — fi s \ 2 which is at 
least ^(||^4 — C\\p/n r e). So even the assignment cost of such points exceeds \\A — C\\ 2 F , the total cost, a 
contradiction. This proves the claim. ■ 

We now show that our algorithm gives a PTAS for the k— means problem. 



Getting a PTAS Let T\ , . . . , be the optimal clustering and /xi, . . . , be the corresponding means. As 
before, n r denotes the size of T r . Let G be the set of points which satisfy the proximity condition (the good 
points). The above claim shows that |G| > (1 — e)n. For simplicity, assume that exactly e fraction of the 
points do not satisfy the proximity condition. 

Let Si, . . . , S r be the clustering output by our algorithm. Let p! r be the mean of S r . First observe that 
Theorem 15 . 5 1 implies that when our algorithm stops, 

, , c • y/ken , 

Mr-Mr<— U-C- (7) 

n r 

for some constant c. For a point A4, let a(Ai) the square of its distance to the closest mean among 
Hl,...,Hk- Define a'(Ai) for the solution output by our algorithm similarly. 

Claim 6.11 If At £ G, then a'(Ai) < (1 + O(e)) • a(Ai). 

Proof. Suppose A4 G T r , and it does not satisfy the proximity condition for the pair fj, r ,fJL s . Then it is easy 
to see that a{Ai) > (|/x r - [i 8 \/2 - A rs ) 2 > lMr "/ s|2 > llA ^f ■ Let fi t be the closest mean to A4. Then 

a'(Ai) < \Ai - /// < (1 + e)\Ai - fi t \ 2 + Q + l) • \lH - Vt? 

< (1 + e)a{A i ) + ^ • \\A - C\\ 2 < (1 + 0(e))a(Ai) 

where the second last inequality follows from equation f7]). Note that the constant in 0(e) above contains 
terms involving k and w m - m . ■ 

Claim 6.12 If Ai £ G, but is mis-classified by our algorithm, then a'(A{) < (1 + 0(e)) ■ a(Af). 

Proof. Suppose Ai G GnT r nS s . We use the machinery developed in the proof of Theorem l5.4l Define A, u 
as in the proof of this theorem. Clearly, a' (Ai) < a(Ai) + 2|/i r — fi s \ 2 (here we have also used equation ([7])). 
But note that a(A { ) > \u\ 2 . Now, \u\ > ^rA^phi = Q ( ^^ j (again using equation ©). This implies 
the result. ■ 

Claim 6.13 For all r, 

£ a'(Ai) < (1 + 0(x/i)) • E «(^) + OWe) ■ ^ a(A). 
AjSGnTrnSr Ai£GnT r ns r AtfG 
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Proof. Clearly, 

\Ai - fi' r \ 2 <(1 + Ve- P)\Ai - /v| 2 + (l + ~j=-7p) \th ~ Mrl 2 . 
where B is a large constant in terms of k, —^—,c. Summing this over all points in T r D S r , we get 

"'min 

£ a'(A)<(l + 0(^)) £ «W+ Vg||A " Cf||a 
AieGnTrnSr- A 4 eGnT r n5 r 

where the last inequality follows from (0) assuming /? is large enough. But now, the proof of Claim loTTTl 

I Ijt fy\ |2 

implies that J2a^g a (^i) ^ — 4~- ^° we are done. ■ 

Now summing over all r in Claim 16.131 and using Claims 16.111 16.121 implies that our algorithm is also a 
PTAS. 

7 Boosting 

Recall that the proximity condition requires that the distance between the means be polynomially dependent 

on — — this could be quite poor when one of the clusters is considerably smaller that the others. In this 

section, we try to overcome this obstacle for a special class of distributions. 

Let F\ , . . . , -Ffc be a mixture of distributions in d dimensions. Let A be the nx d matrix of samples from 
the distribution and C be the corresponding matrix of centers. Let D mm denote min rsr -^ s — fj, s \- Then 
the key property that we desire from the mixture of distributions is as follows. The following conditions 
should be satisfied with high probability : 



1. For all r,s,r^ s, 



10k\\A-C\\ 

Mr — Ms — 7= \°) 

Jn 



2. For all i, 

\Ai-Ci\< Anin • Vdn a polylog(n), (9) 
where a is a small enough constant (something like 0. 1 will suffice). 

3. For all r,s,r ^ s, 

Y^Ui-^-vfK -\T r \ (10) 

where v is the unit vector joining \x T and \i s . This condition is essentially saying that the average 
variance of points in T r along v is bounded by ^ r ^ 3 L 

The number of samples n will be a polynomial in Recall that D m i n denotes min r s r ^ s \/i r fj, s \. To 

simplify the presentation, we assume that |/i r — fj, s \ < Anin ■ {^r~ ) ^ f° r a constant f3 for all pairs r, s. We 
will later show how to get rid of this assumption. We now sample two sets of n points from this distribtion 
- call these A and B. Assume that both A and B satisfy the condtions (© and (fTOl . For all r, we assume 
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that the mean of A4, i S T r is /x r and T r n A has size w r ■ n. We assume the same for the points in B. The 
error caused by removing this assumption will not change our results. Let /x denote the overall mean of the 
points in A (or B). Note that /x = J2 r w r p, r . We translate the points so that the overall mean is 0. In other 
words, define a translation / as f(x) = x — /x. Let A\ denote f{A.j). Define B[ similarly. We now define a 
set X of n points in n dimensions. The point X{ is defined as 

(A> i .B> 1 ,...,A> i .B> n ). 

The correspondence between Xi and Ai naturally defines a partitioning of X into k clusters. Let S r ,r = 
1, . . . , k, denote these clusters. The mean 9 r of S r is 

(C' r ■ B[, . . . ,C' r ■ B' n ) , 

where C' r = C r — /x. Let Z denote the matrix of means of X, i.e., Z, t = 9 r if X$ £ S r . We now show that 
this process amplifies the distance between the means 9 r by a much bigger factor than 11 " . 

Lemma 7.1 For all r,s,r 7^ s, 

\p, r — /x^p 

Proof. First observe that (/x r — /x s ) ■ (/x r — /x s ) = (/x r — /x) ■ (fj, r — /x s ) — (/x s — /x) • (/x r — /x s ). So at 
least one of |(/x r — /x) • (/x r — /x s )|, |(/x s — /x) • (/x r — /x s )| must be at least . Assume without loss of 

generality that this is so for |(/x r — /x) • (/x r — /x s )|. Now consider the coordinates i of 6 r — 9 S corresponding 
to S r . Such a coordinate will have value (/x r — /x s ) • £>,'. Therefore, 

|^r-^| 2 > E [(^r-^)-(Si-/i)] 2 

«e5 r 



> [(Mr - Ms) • (M - Mr)] - J] [(Mr ~ Ms) • (A ~ Mr)] 

^ Pr 1 1 4 

> "T7r ' Mr — Ms 

lb 

where the last inequality follows from ( fTOl ). This proves the lemma. ■ 

Now we bound ||X — Z\\. 
Lemma 7.2 With high probability, 

" A " Z " <i^ in -d-n 2a . f— V-polylog(n) 



n \w 



Proof. Let F denote the matrix X — Z, and D denote the matrix r?[Y T y] where the expectation is over the 
choice of A and B. We shall use Fact 16. II to bound ||Y||. Thus, we just need to bound maxj \Yi\ and \\D\\. 

Let 7 denote L>^ in • d ■ n 2a ■ (^-) • polylog(n). 
Claim 7.3 For all i, 

\Yi\ < 

Proof. Suppose Xj € SV. Then the j* 71 coordinate of Y is (Aj — /x r ) • (Sj — /x) = (j4j — /x r ) • 
((-Bj — pL r i) + (/x — fi r ')), where r' is such that Bj £ T r >. Now, condition (© implies that |A, — fi r \,\Bj — 
Mr' I < -Dmin • Vdn a polylog(n) . This implies the claim. ■ 
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Claim 7.4 



\D\ \ < 7 2 n 



Proof. We can write Y T Y as J2i Y?Y{. Let v be any unit vector. Then v T E[Y T Y]v = J2i E\Yi ■ v\ 2 . For 
a fixed i, where A{ G T r , 

^l^-^l 2 = # ^E«i[(^i-Atr)'(lj--M)] 

= ^^[(^i-Mr)-(^-M)] 2 
3 

where the last inequality follows from the fact that expectation of Yj is \x and Yj , Y- are independent if 
j 7^ /. Rest of the argument is as in Claim 1731 ■ 

The above two claims combined with Fact l6.1 l imply the lemma. ■ 

/ j \ 809+1) 

Now we pick n to be at least — — . Assuming a < 0.1, this implies (using the above two 

\ ^min / 

lemmas) that for all r, s, r 7^ s, 

.„ „ . NX - Zll / d \ 4/3 

^r-^ S >- (11) 



n v w 

We now run the first step of the algorithm Cluster on X. We claim that the clustering obtained after 
the first step has very few classification errors. Let 4> r ,r = 1, . . . , k be the k centers output by the first step 
of the algorithm Cluster. Lemma I5TT1 implies that for each center 9 r , there exists a center <fi r satisfying 

II V ^71 I 

\9 r ~ 4>r\ < 20Vk- 



Order the centers 4> r such that <f) r is closest to r - equation (TTTb implies that the closest estimated centers 
to different r are distinct. It also follows that for r ^ s 

1 \\x-z\\ ( d 



^-^'^■^•Uri (12) 



Lemma 7.5 The number of points in S r which are not assigned to (f) r after the first step of the algorithm is 
at most ( H ^ ia ) 2/3 • n. 

Proof. We use the notation in Step 1 of algorithm Cluster. Suppose the statement of the lemma is not 
true. Then, in the A;-means solution, at least (Hsin) 2 ' 3 . n points in X{, i G S r are assigned to a center at least 

1 . Iix 



-Zll / rl \ " 

=^ ■ — — distance away (using equation [121) . But then the square of A;-means clustering cost is 



2 y/fi 

much larger that k • | \X — Z\ | 2 . ■ 

Now, we use the clustering given by the centers 4> r to partition the original set of points A - thus we 
have a clsutering of these points where the number of mis-classified points from any cluster T r is at most 
(^^T 1 ) 2 ^ ' n - L et &r denote this clustering, where S r corresponds to T r . Let v r denote the center of S r . We 
now argue that | u r — \i r \ is very small. 

Lemma 7.6 For every s, \ v s — p, s \ < ^ A ^^ ■ 
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Proof. We use arguments similar to proof of Theorem 15.51 Let n rs , p rs denote the number and mean 
respectively of T r n S s . Similarly, define n ss and p ss as the size and mean of the points in T s n S s . We 
know that 

l^al^s — ^ssf^ss ~t~ ^ , ^rsf^rs- 



Theorem \5A\ implies that 
and Corollary 15.31 implies that 



loo-nA-di 

I Mrs Ms I — 



I I ^ VPs — ™ss I, . n \\ 
|Mss ~ Ma| < 11^ - C||. 



Now, proceeding as in the proof of Theorem [531 we get 



I i i i X — ^ W'VS i 

|Ms-^s| < ttttIMss - Mai + [cT ~~ A*s 

(Jo , 

I si I SI 



< 400-IIA-CH 

l-^s I * * 



\J n r 

r^s 



Using Lemma 1731 now implies the result. 



Starting from the centers v r , we run the second step of algorithm Cluster. Then, we have the analogue 
of Theorem l2.2l in this setting. 

Theorem 7.7 Suppose a mixture of distribution satisfies the conditions fISUiOD above and at least (1 — e) 
fraction of sampled points satisfy the proximity condition. Then we can correctly classify all but 0(k 2 e) 
fraction of the points. 

We now remove the assumption that \p, r — (i s \ < D m i Q ■ (^-^— \^ - let 7 denote the latter quantity. We 
construct a graph G = (V, E) as follows : V is the set of points A U B, and we join two points by an edge 
if the distance between them is at most 7/A;. First observe that if i,j G T r , then they will be joined by an 
edge provided the following condition holds (using condition (©) : 

Anin • \/rfn a polylog(n) < D min ■ (-^—) , 

/ , \ 808+1) 

and the same for A replaced by B above. This would hold if a < 0.1 (recall that n is roughly f J ). 
Now consider the connected components of this graph. In each connected component, any two vertices are 
joined by a path of length at most k (because any two vertices from the same cluster T r have an edge 
between them). So the distance between any two vertices from the same component is at most 7. Therefore 
the distance from the mean of two clusters in the same component is at most 7. Now, we can apply the 
arguments of this section to each component independently. This would, however, require us to know the 
number of clusters in each component of this graph. If we treat k as a constant, this is only constant number 
of choices. A better way is to modify the definition of X as follows : consider a point Ai. Let \i denote the 
mean of the points in the same component as A4 in the graph G. Then X{j = (Ai — p) ■ (Bj — p) if Ai, Bj 
belong to the same component of G, L otherwise, where L is a large quantity. Now note that 9 r — 9 S will still 
satisfy the statement of Lemma|TT] because if they are from the same component in G, then it follows from 
the lemma, otherwise the distance between them is at least L. But Lemma lT2l continues to hold without any 
change, and so rest of the arguments follow as they are. 
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7.1 Applications 

We now give some applications of Theorem |7.7j 



7.1.1 Learning Gaussian Distributions 

Suppose we are given a mixture of Gaussian distribution Fi,..., Fy.. Suppose the means satisfy the follow- 
ing separation condition for all r, s, r ^ s : 

( d 

\fJ-r — fJ-s\ > f2 I <7& • lo, 



where a denotes the maximum variance in any direction of the Gaussian distributions. Sample a set of 
n = poly (^p - ) points. It is easy to check using Fact l6.1l that \\A — C|| is 0(aVd • log re). It is also easy 
to check that condition (fTOb is satisfied with a = 0. Therefore, Theorem 17.71 implies the following. 

Lemma 7.8 Given a mixture of k Gaussians satisfying the separation condition above, we can correctly 
classify a set n samples, where n = poly ( — — ) . 

7.1.2 Learning Power Law Distributions 

Consider a mixture of distributions Fi,...,Fk where each of the distributions F r satisfies the following 
condition for every unit vector v : 

P XeFr [\(X-fi r )-v\ >at] < 1 (13) 

where 7 > 2 is a large enough constant. Let A be a set of n samples from the mixture. Suppose the means 
satisfy the following separation condition for every r,s,r ^ s : 

\fJ>r ~ Ms I >^{crk - (log — h —3- ] ] . 

V V Wmin e~ ) ) 

First observe that since this is a special class of distributions considered in Section [631 So, one can again 
prove that ^^7^ is 0(a ■ Vd ■ polylog(n)). This is off from condition ([U) by a factor of yfd. But for large 
enough n, inequality (fTTT) will continue to hold. Now let us try to bound 

Claim 7.9 With high probability, 

max|^-a|<^ min .v^-n^p lylog(n). 

i 

1 

Proof. Let e\, . . . , be orthonormal basis for the space. Then \ (A{ — d) ■ e.\\ < a(nd)i ■ log(n) for all i 
with high probability. So, with high probability, for all i, 

I A - CA < D min Vdn~. 



is 



Finally, we verify condition (flOl) . Let v be a vector joining p, r and fi s . Then, E {{Ai — Cj) • v) 2 

0(a 2 ) provided 7 > 2. Now summing over all A4 G T r and taking union bound for all k 2 choices for v 
proves that condition ( fTOb is also satisfied. It is also easy to check that at least 1 — e fraction of the points 
satisfy the proximity condition. So we have 

Theorem 7.10 Given a mixture of distributions where each distribution satisfies di il ), we can cluster at 
least 1 — e fraction of the points. 
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